EDA - Home Credit¶

Problem Statement:¶

Conduct a comprehensive analysis of a dataset containing various financial and demographic attributes of loan applicants. Our goal is to gain insights into the factors influencing loan default rates and to develop strategies to mitigate risks associated with lending.

Objectives:¶

Task 1 - Importing necessary Modules:

  • Import the modules necessary for Data Manipulation and Visualization.

Task 2 - Reading dataset:

  • Read the dataset containing loan applicant information.

Task 3 - Exploring the Dataset:

  • Understand the Structure and various datatypes of the attributes within the dataset.

Task 4 - Missing value analysis:

  • Identify and analyze missing values in the dataset.

Task 5 - Analysing categorical and numerical columns:

  • Analyze categorical and numerical columns to understand the statistical properties and relationships within the dataset.

Task 6 - Univariate Analysis:

  • Conduct univariate analysis to explore the distribution and characteristics of individual variables.

Task 7 - Outliers:

  • Identify and analyze outliers within the dataset to understand their impact on the analysis.

Task 8 - Merging Datasets:

  • Identify and merge different Datasets for further analysis.

Task 9 - Bivariate analysis:

  • Conduct bivariate analysis to explore relationships between different variables and their impact on loan default rates.

Task 1 - Import the dataset¶

Description¶

In this task, you will import all the necessary modules and packages required for performing various operations in the project.

In [7]:
import pandas as pd  # For data manipulation and analysis
import numpy as np   # For numerical computation
import matplotlib.pyplot as plt  # For plotting
import seaborn as sns  # For advanced statistical visualizations
sns.set(color_codes=True)  # Set seaborn color palette to default
In [8]:
import warnings

# Filter out warnings to ignore them
warnings.filterwarnings('ignore')

Task 2 - Reading the Dataset¶

In [14]:
# Read the CSV file 'application_data.csv' into a DataFrame
application_train = pd.read_csv('application_data.csv')

Task 3 - Exploring the Data¶

In [17]:
# Display the first few rows of the DataFrame to inspect its structure and contents
application_train.head()
Out[17]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 122 columns

In [19]:
# Get the dimensions of the DataFrame (number of rows and columns)
application_train.shape
Out[19]:
(211607, 122)
In [21]:
# Get concise summary information about the DataFrame, including column data types and non-null counts
application_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211607 entries, 0 to 211606
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(101), int64(5), object(16)
memory usage: 197.0+ MB
In [23]:
application_train.iloc[:,:100].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211607 entries, 0 to 211606
Data columns (total 100 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   SK_ID_CURR                    211607 non-null  int64  
 1   TARGET                        211607 non-null  int64  
 2   NAME_CONTRACT_TYPE            211607 non-null  object 
 3   CODE_GENDER                   211607 non-null  object 
 4   FLAG_OWN_CAR                  211607 non-null  object 
 5   FLAG_OWN_REALTY               211607 non-null  object 
 6   CNT_CHILDREN                  211607 non-null  int64  
 7   AMT_INCOME_TOTAL              211607 non-null  float64
 8   AMT_CREDIT                    211607 non-null  float64
 9   AMT_ANNUITY                   211598 non-null  float64
 10  AMT_GOODS_PRICE               211434 non-null  float64
 11  NAME_TYPE_SUITE               210739 non-null  object 
 12  NAME_INCOME_TYPE              211607 non-null  object 
 13  NAME_EDUCATION_TYPE           211607 non-null  object 
 14  NAME_FAMILY_STATUS            211607 non-null  object 
 15  NAME_HOUSING_TYPE             211607 non-null  object 
 16  REGION_POPULATION_RELATIVE    211607 non-null  float64
 17  DAYS_BIRTH                    211607 non-null  int64  
 18  DAYS_EMPLOYED                 211607 non-null  int64  
 19  DAYS_REGISTRATION             211607 non-null  float64
 20  DAYS_ID_PUBLISH               211606 non-null  float64
 21  OWN_CAR_AGE                   71891 non-null   float64
 22  FLAG_MOBIL                    211606 non-null  float64
 23  FLAG_EMP_PHONE                211606 non-null  float64
 24  FLAG_WORK_PHONE               211606 non-null  float64
 25  FLAG_CONT_MOBILE              211606 non-null  float64
 26  FLAG_PHONE                    211606 non-null  float64
 27  FLAG_EMAIL                    211606 non-null  float64
 28  OCCUPATION_TYPE               145141 non-null  object 
 29  CNT_FAM_MEMBERS               211604 non-null  float64
 30  REGION_RATING_CLIENT          211606 non-null  float64
 31  REGION_RATING_CLIENT_W_CITY   211606 non-null  float64
 32  WEEKDAY_APPR_PROCESS_START    211606 non-null  object 
 33  HOUR_APPR_PROCESS_START       211606 non-null  float64
 34  REG_REGION_NOT_LIVE_REGION    211606 non-null  float64
 35  REG_REGION_NOT_WORK_REGION    211606 non-null  float64
 36  LIVE_REGION_NOT_WORK_REGION   211606 non-null  float64
 37  REG_CITY_NOT_LIVE_CITY        211606 non-null  float64
 38  REG_CITY_NOT_WORK_CITY        211606 non-null  float64
 39  LIVE_CITY_NOT_WORK_CITY       211606 non-null  float64
 40  ORGANIZATION_TYPE             211606 non-null  object 
 41  EXT_SOURCE_1                  92257 non-null   float64
 42  EXT_SOURCE_2                  211138 non-null  float64
 43  EXT_SOURCE_3                  169578 non-null  float64
 44  APARTMENTS_AVG                104102 non-null  float64
 45  BASEMENTAREA_AVG              87640 non-null   float64
 46  YEARS_BEGINEXPLUATATION_AVG   108328 non-null  float64
 47  YEARS_BUILD_AVG               70860 non-null   float64
 48  COMMONAREA_AVG                63648 non-null   float64
 49  ELEVATORS_AVG                 98733 non-null   float64
 50  ENTRANCES_AVG                 104974 non-null  float64
 51  FLOORSMAX_AVG                 106237 non-null  float64
 52  FLOORSMIN_AVG                 67976 non-null   float64
 53  LANDAREA_AVG                  85853 non-null   float64
 54  LIVINGAPARTMENTS_AVG          66902 non-null   float64
 55  LIVINGAREA_AVG                105325 non-null  float64
 56  NONLIVINGAPARTMENTS_AVG       64662 non-null   float64
 57  NONLIVINGAREA_AVG             94778 non-null   float64
 58  APARTMENTS_MODE               104102 non-null  float64
 59  BASEMENTAREA_MODE             87640 non-null   float64
 60  YEARS_BEGINEXPLUATATION_MODE  108328 non-null  float64
 61  YEARS_BUILD_MODE              70860 non-null   float64
 62  COMMONAREA_MODE               63648 non-null   float64
 63  ELEVATORS_MODE                98733 non-null   float64
 64  ENTRANCES_MODE                104974 non-null  float64
 65  FLOORSMAX_MODE                106237 non-null  float64
 66  FLOORSMIN_MODE                67976 non-null   float64
 67  LANDAREA_MODE                 85853 non-null   float64
 68  LIVINGAPARTMENTS_MODE         66902 non-null   float64
 69  LIVINGAREA_MODE               105325 non-null  float64
 70  NONLIVINGAPARTMENTS_MODE      64662 non-null   float64
 71  NONLIVINGAREA_MODE            94778 non-null   float64
 72  APARTMENTS_MEDI               104102 non-null  float64
 73  BASEMENTAREA_MEDI             87640 non-null   float64
 74  YEARS_BEGINEXPLUATATION_MEDI  108328 non-null  float64
 75  YEARS_BUILD_MEDI              70860 non-null   float64
 76  COMMONAREA_MEDI               63648 non-null   float64
 77  ELEVATORS_MEDI                98733 non-null   float64
 78  ENTRANCES_MEDI                104974 non-null  float64
 79  FLOORSMAX_MEDI                106237 non-null  float64
 80  FLOORSMIN_MEDI                67976 non-null   float64
 81  LANDAREA_MEDI                 85853 non-null   float64
 82  LIVINGAPARTMENTS_MEDI         66902 non-null   float64
 83  LIVINGAREA_MEDI               105325 non-null  float64
 84  NONLIVINGAPARTMENTS_MEDI      64662 non-null   float64
 85  NONLIVINGAREA_MEDI            94778 non-null   float64
 86  FONDKAPREMONT_MODE            66867 non-null   object 
 87  HOUSETYPE_MODE                105311 non-null  object 
 88  TOTALAREA_MODE                109399 non-null  float64
 89  WALLSMATERIAL_MODE            103942 non-null  object 
 90  EMERGENCYSTATE_MODE           111260 non-null  object 
 91  OBS_30_CNT_SOCIAL_CIRCLE      210910 non-null  float64
 92  DEF_30_CNT_SOCIAL_CIRCLE      210910 non-null  float64
 93  OBS_60_CNT_SOCIAL_CIRCLE      210910 non-null  float64
 94  DEF_60_CNT_SOCIAL_CIRCLE      210910 non-null  float64
 95  DAYS_LAST_PHONE_CHANGE        211605 non-null  float64
 96  FLAG_DOCUMENT_2               211606 non-null  float64
 97  FLAG_DOCUMENT_3               211606 non-null  float64
 98  FLAG_DOCUMENT_4               211606 non-null  float64
 99  FLAG_DOCUMENT_5               211606 non-null  float64
dtypes: float64(79), int64(5), object(16)
memory usage: 161.4+ MB
In [24]:
application_train.iloc[:,100:122].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211607 entries, 0 to 211606
Data columns (total 22 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   FLAG_DOCUMENT_6             211606 non-null  float64
 1   FLAG_DOCUMENT_7             211606 non-null  float64
 2   FLAG_DOCUMENT_8             211606 non-null  float64
 3   FLAG_DOCUMENT_9             211606 non-null  float64
 4   FLAG_DOCUMENT_10            211606 non-null  float64
 5   FLAG_DOCUMENT_11            211606 non-null  float64
 6   FLAG_DOCUMENT_12            211606 non-null  float64
 7   FLAG_DOCUMENT_13            211606 non-null  float64
 8   FLAG_DOCUMENT_14            211606 non-null  float64
 9   FLAG_DOCUMENT_15            211606 non-null  float64
 10  FLAG_DOCUMENT_16            211606 non-null  float64
 11  FLAG_DOCUMENT_17            211606 non-null  float64
 12  FLAG_DOCUMENT_18            211606 non-null  float64
 13  FLAG_DOCUMENT_19            211606 non-null  float64
 14  FLAG_DOCUMENT_20            211606 non-null  float64
 15  FLAG_DOCUMENT_21            211606 non-null  float64
 16  AMT_REQ_CREDIT_BUREAU_HOUR  183034 non-null  float64
 17  AMT_REQ_CREDIT_BUREAU_DAY   183034 non-null  float64
 18  AMT_REQ_CREDIT_BUREAU_WEEK  183034 non-null  float64
 19  AMT_REQ_CREDIT_BUREAU_MON   183034 non-null  float64
 20  AMT_REQ_CREDIT_BUREAU_QRT   183034 non-null  float64
 21  AMT_REQ_CREDIT_BUREAU_YEAR  183034 non-null  float64
dtypes: float64(22)
memory usage: 35.5 MB

Task 4 - Missing Values Analysis¶

Calculating the Percentage of Missing values¶

In [29]:
# Calculate the proportion of non-missing values for each column
train_missing = application_train.count() / len(application_train)

# Convert the proportions to percentages and calculate the percentage of missing values for each column
train_missing = (1 - train_missing) * 100

# Sort the missing percentages in descending order and display the top 60 columns
train_missing.sort_values(ascending=False).head(60)
Out[29]:
COMMONAREA_AVG                  69.921600
COMMONAREA_MODE                 69.921600
COMMONAREA_MEDI                 69.921600
NONLIVINGAPARTMENTS_MEDI        69.442410
NONLIVINGAPARTMENTS_MODE        69.442410
NONLIVINGAPARTMENTS_AVG         69.442410
FONDKAPREMONT_MODE              68.400384
LIVINGAPARTMENTS_AVG            68.383844
LIVINGAPARTMENTS_MEDI           68.383844
LIVINGAPARTMENTS_MODE           68.383844
FLOORSMIN_MODE                  67.876299
FLOORSMIN_AVG                   67.876299
FLOORSMIN_MEDI                  67.876299
YEARS_BUILD_AVG                 66.513395
YEARS_BUILD_MODE                66.513395
YEARS_BUILD_MEDI                66.513395
OWN_CAR_AGE                     66.026171
LANDAREA_MEDI                   59.428091
LANDAREA_AVG                    59.428091
LANDAREA_MODE                   59.428091
BASEMENTAREA_MODE               58.583601
BASEMENTAREA_MEDI               58.583601
BASEMENTAREA_AVG                58.583601
EXT_SOURCE_1                    56.401726
NONLIVINGAREA_MODE              55.210366
NONLIVINGAREA_AVG               55.210366
NONLIVINGAREA_MEDI              55.210366
ELEVATORS_AVG                   53.341336
ELEVATORS_MEDI                  53.341336
ELEVATORS_MODE                  53.341336
WALLSMATERIAL_MODE              50.879697
APARTMENTS_AVG                  50.804085
APARTMENTS_MEDI                 50.804085
APARTMENTS_MODE                 50.804085
ENTRANCES_MODE                  50.392000
ENTRANCES_MEDI                  50.392000
ENTRANCES_AVG                   50.392000
HOUSETYPE_MODE                  50.232743
LIVINGAREA_MEDI                 50.226127
LIVINGAREA_MODE                 50.226127
LIVINGAREA_AVG                  50.226127
FLOORSMAX_MODE                  49.795139
FLOORSMAX_AVG                   49.795139
FLOORSMAX_MEDI                  49.795139
YEARS_BEGINEXPLUATATION_MODE    48.806987
YEARS_BEGINEXPLUATATION_MEDI    48.806987
YEARS_BEGINEXPLUATATION_AVG     48.806987
TOTALAREA_MODE                  48.300860
EMERGENCYSTATE_MODE             47.421399
OCCUPATION_TYPE                 31.410114
EXT_SOURCE_3                    19.861819
AMT_REQ_CREDIT_BUREAU_HOUR      13.502861
AMT_REQ_CREDIT_BUREAU_WEEK      13.502861
AMT_REQ_CREDIT_BUREAU_MON       13.502861
AMT_REQ_CREDIT_BUREAU_YEAR      13.502861
AMT_REQ_CREDIT_BUREAU_DAY       13.502861
AMT_REQ_CREDIT_BUREAU_QRT       13.502861
NAME_TYPE_SUITE                  0.410194
DEF_60_CNT_SOCIAL_CIRCLE         0.329384
OBS_60_CNT_SOCIAL_CIRCLE         0.329384
dtype: float64

Missing value treatment¶

As you can observe, there are lot of columns with missing values. There are some columns which has missing values around or more than 50%. Other columns has significantly less missing value. Also, the columns for which has missing values are around or more than 50% are mostly either mean, median or mode. So, there is no way one can replace these missing data. So, we will not consider these columns for analysis. We will consider other columns for analysis. Let's analyse the other columns.

Removing the Columns with more the 50% missing values¶

In [34]:
# Filter the DataFrame 'application_train' to include only columns with less than 50% missing values
train = application_train.loc[:, train_missing < 50]
In [36]:
# Calculate the proportion of non-missing values for each column
train_missing_2 = train.count() / len(train)

# Convert the proportions to percentages and calculate the percentage of missing values for each column
train_missing_2 = (1 - train_missing_2) * 100

# Sort the missing percentages in descending order
train_missing_2[train_missing_2 > 0].sort_values(ascending=False)
Out[36]:
FLOORSMAX_MEDI                  49.795139
FLOORSMAX_AVG                   49.795139
FLOORSMAX_MODE                  49.795139
YEARS_BEGINEXPLUATATION_AVG     48.806987
YEARS_BEGINEXPLUATATION_MEDI    48.806987
                                  ...    
FLAG_DOCUMENT_16                 0.000473
FLAG_DOCUMENT_15                 0.000473
FLAG_DOCUMENT_14                 0.000473
FLAG_DOCUMENT_20                 0.000473
FLAG_DOCUMENT_21                 0.000473
Length: 64, dtype: float64
In [38]:
# OCCUPATION_TYPE                 31.345545
# EXT_SOURCE_3                    19.825307
# AMT_REQ_CREDIT_BUREAU_HOUR      13.501631
# AMT_REQ_CREDIT_BUREAU_QRT       13.501631
# AMT_REQ_CREDIT_BUREAU_MON       13.501631
# AMT_REQ_CREDIT_BUREAU_WEEK      13.501631
# AMT_REQ_CREDIT_BUREAU_DAY       13.501631
# AMT_REQ_CREDIT_BUREAU_YEAR      13.501631
# NAME_TYPE_SUITE                  0.420148
# DEF_30_CNT_SOCIAL_CIRCLE         0.332021
# OBS_60_CNT_SOCIAL_CIRCLE         0.332021
# DEF_60_CNT_SOCIAL_CIRCLE         0.332021
# OBS_30_CNT_SOCIAL_CIRCLE         0.332021
# EXT_SOURCE_2                     0.214626
# AMT_GOODS_PRICE                  0.090403
# AMT_ANNUITY                      0.003902
# CNT_FAM_MEMBERS                  0.000650
# DAYS_LAST_PHONE_CHANGE           0.000325

Let's analylize these columns one by one

Column: OCCUPATION_TYPE¶

The OCCUPATION_TYPE column has a large amount of missing data, about 31%. Although this might suggest removing the column, when we look at it closely, it seems to be an important piece of information. So, we've chosen to keep the column and not analyze it further.

In [43]:
# Display the first few rows of the 'OCCUPATION_TYPE' column in the DataFrame 'train'
train['OCCUPATION_TYPE'].head()
Out[43]:
0      Laborers
1    Core staff
2      Laborers
3      Laborers
4    Core staff
Name: OCCUPATION_TYPE, dtype: object
In [45]:
#application_train['OCCUPATION_TYPE'].count.plot.hist()
sns.countplot(x= 'OCCUPATION_TYPE', data = train)
plt.xticks(rotation=90)
Out[45]:
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17],
 [Text(0, 0, 'Laborers'),
  Text(1, 0, 'Core staff'),
  Text(2, 0, 'Accountants'),
  Text(3, 0, 'Managers'),
  Text(4, 0, 'Drivers'),
  Text(5, 0, 'Sales staff'),
  Text(6, 0, 'Cleaning staff'),
  Text(7, 0, 'Cooking staff'),
  Text(8, 0, 'Private service staff'),
  Text(9, 0, 'Medicine staff'),
  Text(10, 0, 'Security staff'),
  Text(11, 0, 'High skill tech staff'),
  Text(12, 0, 'Waiters/barmen staff'),
  Text(13, 0, 'Low-skill Laborers'),
  Text(14, 0, 'Realty agents'),
  Text(15, 0, 'Secretaries'),
  Text(16, 0, 'IT staff'),
  Text(17, 0, 'HR staff')])
No description has been provided for this image

This column exhibits a significant proportion of missing values, amounting to 31%. Given that it is a categorical column, the only viable substitution for missing data is with the mode value. However, substituting all these missing values with the "Laborers" category wouldn't be appropriate. Therefore, we've decided not to perform any missing value treatment on this column and leave it as is.

Column: EXT_SOURCE_3¶

In [49]:
# Display the first 10 rows of the 'EXT_SOURCE_3' column in the DataFrame 'train'
train['EXT_SOURCE_3'].head(10)
Out[49]:
0    0.139376
1         NaN
2    0.729567
3         NaN
4         NaN
5    0.621226
6    0.492060
7    0.540654
8    0.751724
9         NaN
Name: EXT_SOURCE_3, dtype: float64
In [51]:
# Plotting the distribution
train['EXT_SOURCE_3'].plot.hist()
Out[51]:
<Axes: ylabel='Frequency'>
No description has been provided for this image
In [53]:
# Calculate the mean of the 'EXT_SOURCE_3' column in the DataFrame 'train'
train['EXT_SOURCE_3'].mean()
Out[53]:
np.float64(0.5107185957866892)
In [55]:
# Calculate the mode (most frequent value) of the 'EXT_SOURCE_3' column in the DataFrame 'train'
train['EXT_SOURCE_3'].mode()
Out[55]:
0    0.7463
Name: EXT_SOURCE_3, dtype: float64

This data is a skewed normal distribution

Column: AMT_REQ_CREDIT_BUREAU_QRT¶

In [59]:
# Display the first few rows of the 'AMT_REQ_CREDIT_BUREAU_QRT' column in the DataFrame 'train'
train['AMT_REQ_CREDIT_BUREAU_QRT'].head()
Out[59]:
0    0.0
1    0.0
2    0.0
3    NaN
4    0.0
Name: AMT_REQ_CREDIT_BUREAU_QRT, dtype: float64
In [61]:
# Calculate the number of unique values in the 'AMT_REQ_CREDIT_BUREAU_QRT' column
num_unique_values = len(train.AMT_REQ_CREDIT_BUREAU_QRT.unique())

# Plot a histogram of the 'AMT_REQ_CREDIT_BUREAU_QRT' column with bins equal to the number of unique values
train['AMT_REQ_CREDIT_BUREAU_QRT'].plot.hist(bins=num_unique_values)
Out[61]:
<Axes: ylabel='Frequency'>
No description has been provided for this image
In [63]:
# Count the occurrences of each unique value in the 'AMT_REQ_CREDIT_BUREAU_QRT' column
train.AMT_REQ_CREDIT_BUREAU_QRT.value_counts()
Out[63]:
AMT_REQ_CREDIT_BUREAU_QRT
0.0    148192
1.0     23367
2.0      9889
3.0      1190
4.0       316
5.0        46
6.0        23
7.0         6
8.0         5
Name: count, dtype: int64
In [65]:
# Calculate the mean (average) of the values in the 'AMT_REQ_CREDIT_BUREAU_QRT' column
train['AMT_REQ_CREDIT_BUREAU_QRT'].mean()
Out[65]:
np.float64(0.2645901854300294)
In [67]:
# Calculate the mode (most frequent value) of the 'AMT_REQ_CREDIT_BUREAU_QRT' column
train['AMT_REQ_CREDIT_BUREAU_QRT'].mode()
Out[67]:
0    0.0
Name: AMT_REQ_CREDIT_BUREAU_QRT, dtype: float64

Since the column only takes discrete values, 1 or 0, so we cannot replace it by mean value. Here, we will replace it by mode.

Column: AMT_REQ_CREDIT_BUREAU_YEAR¶

In [71]:
# Display the first 10 rows of the 'AMT_REQ_CREDIT_BUREAU_YEAR' column in the DataFrame 'train'
train['AMT_REQ_CREDIT_BUREAU_YEAR'].head(10)
Out[71]:
0    1.0
1    0.0
2    0.0
3    NaN
4    0.0
5    1.0
6    2.0
7    0.0
8    1.0
9    NaN
Name: AMT_REQ_CREDIT_BUREAU_YEAR, dtype: float64
In [73]:
# Calculate the number of unique values in the 'AMT_REQ_CREDIT_BUREAU_YEAR' column
num_unique_values = len(train.AMT_REQ_CREDIT_BUREAU_YEAR.unique())

# Plot a histogram of the 'AMT_REQ_CREDIT_BUREAU_YEAR' column with bins equal to the number of unique values
train['AMT_REQ_CREDIT_BUREAU_YEAR'].plot.hist(bins=num_unique_values)
Out[73]:
<Axes: ylabel='Frequency'>
No description has been provided for this image

As you can see that the values are only integers, either 0,1,2 or something else. If we will replace, we will replace by mode value and not mean value because mean value will give an decimal value.

In [76]:
# Calculate the mean (average) of the values in the 'AMT_REQ_CREDIT_BUREAU_YEAR' column
train['AMT_REQ_CREDIT_BUREAU_YEAR'].mean()
Out[76]:
np.float64(1.8944840849241125)
In [78]:
# Calculate the mode (most frequent value) of the 'AMT_REQ_CREDIT_BUREAU_YEAR' column
train['AMT_REQ_CREDIT_BUREAU_YEAR'].mode()
Out[78]:
0    0.0
Name: AMT_REQ_CREDIT_BUREAU_YEAR, dtype: float64

Column: AMT_REQ_CREDIT_BUREAU_WEEK¶

In [81]:
# Display the first 10 rows of the 'AMT_REQ_CREDIT_BUREAU_WEEK' column in the DataFrame 'train'
train['AMT_REQ_CREDIT_BUREAU_WEEK'].head(10)
Out[81]:
0    0.0
1    0.0
2    0.0
3    NaN
4    0.0
5    0.0
6    0.0
7    0.0
8    0.0
9    NaN
Name: AMT_REQ_CREDIT_BUREAU_WEEK, dtype: float64
In [83]:
# Calculate the number of unique values in the 'AMT_REQ_CREDIT_BUREAU_WEEK' column
num_unique_values = len(train.AMT_REQ_CREDIT_BUREAU_WEEK.unique())

# Plot a histogram of the 'AMT_REQ_CREDIT_BUREAU_WEEK' column with bins equal to the number of unique values
train['AMT_REQ_CREDIT_BUREAU_WEEK'].plot.hist(bins=num_unique_values)
Out[83]:
<Axes: ylabel='Frequency'>
No description has been provided for this image

As you can see, mostly the values are 0's. So, this column is mostly acting as a constant and has no variation. So, we can ignore this column for analysis.

Column: AMT_REQ_CREDIT_BUREAU_MON¶

In [87]:
# Display the first 10 rows of the 'AMT_REQ_CREDIT_BUREAU_MON' column in the DataFrame 'train'
train['AMT_REQ_CREDIT_BUREAU_MON'].head(10)
Out[87]:
0    0.0
1    0.0
2    0.0
3    NaN
4    0.0
5    0.0
6    1.0
7    0.0
8    0.0
9    NaN
Name: AMT_REQ_CREDIT_BUREAU_MON, dtype: float64
In [89]:
# Calculate the number of unique values in the 'AMT_REQ_CREDIT_BUREAU_MON' column
num_unique_values = len(train.AMT_REQ_CREDIT_BUREAU_MON.unique())

# Plot a histogram of the 'AMT_REQ_CREDIT_BUREAU_MON' column with bins equal to the number of unique values
train['AMT_REQ_CREDIT_BUREAU_MON'].plot.hist(bins=num_unique_values)
Out[89]:
<Axes: ylabel='Frequency'>
No description has been provided for this image

Again the same observation that mostly the values are 0's. So, this column is mostly acting as a constant and has no variation. So, we can ignore this column.

Column: AMT_REQ_CREDIT_BUREAU_DAY¶

In [93]:
# Display the first 10 rows of the 'AMT_REQ_CREDIT_BUREAU_DAY' column in the DataFrame 'train'
train['AMT_REQ_CREDIT_BUREAU_DAY'].head(10)
Out[93]:
0    0.0
1    0.0
2    0.0
3    NaN
4    0.0
5    0.0
6    0.0
7    0.0
8    0.0
9    NaN
Name: AMT_REQ_CREDIT_BUREAU_DAY, dtype: float64
In [95]:
# Calculate the number of unique values in the 'AMT_REQ_CREDIT_BUREAU_DAY' column
num_unique_values = len(train.AMT_REQ_CREDIT_BUREAU_DAY.unique())

# Plot a histogram of the 'AMT_REQ_CREDIT_BUREAU_DAY' column with bins equal to the number of unique values
train['AMT_REQ_CREDIT_BUREAU_DAY'].plot.hist(bins=num_unique_values)
Out[95]:
<Axes: ylabel='Frequency'>
No description has been provided for this image
In [97]:
# Count the occurrences of each unique value in the 'AMT_REQ_CREDIT_BUREAU_DAY' column
train.AMT_REQ_CREDIT_BUREAU_DAY.value_counts()
Out[97]:
AMT_REQ_CREDIT_BUREAU_DAY
0.0    181983
1.0       908
2.0        75
3.0        29
4.0        23
5.0         7
6.0         6
9.0         2
8.0         1
Name: count, dtype: int64

As you can see, all the values are 0's. So, we can ignore this column.

Column: AMT_REQ_CREDIT_BUREAU_HOUR¶

In [101]:
# Display the first 10 rows of the 'AMT_REQ_CREDIT_BUREAU_HOUR' column in the DataFrame 'train'
train['AMT_REQ_CREDIT_BUREAU_HOUR'].head(10)
Out[101]:
0    0.0
1    0.0
2    0.0
3    NaN
4    0.0
5    0.0
6    0.0
7    0.0
8    0.0
9    NaN
Name: AMT_REQ_CREDIT_BUREAU_HOUR, dtype: float64
In [103]:
# Calculate the number of unique values in the 'AMT_REQ_CREDIT_BUREAU_HOUR' column
num_unique_values = len(train.AMT_REQ_CREDIT_BUREAU_HOUR.unique())

# Plot a histogram of the 'AMT_REQ_CREDIT_BUREAU_HOUR' column with bins equal to the number of unique values
train['AMT_REQ_CREDIT_BUREAU_HOUR'].plot.hist(bins=num_unique_values)
Out[103]:
<Axes: ylabel='Frequency'>
No description has been provided for this image
In [104]:
# Count the occurrences of each unique value in the 'AMT_REQ_CREDIT_BUREAU_HOUR' column
train.AMT_REQ_CREDIT_BUREAU_HOUR.value_counts()
Out[104]:
AMT_REQ_CREDIT_BUREAU_HOUR
0.0    181908
1.0      1071
2.0        46
3.0         8
4.0         1
Name: count, dtype: int64

Again, we can ignore this column.

In [108]:
# Get the names of all columns in the DataFrame 'train'
train.columns.values
Out[108]:
array(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN',
       'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE',
       'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
       'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'FLAG_MOBIL',
       'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE',
       'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS',
       'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY',
       'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
       'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
       'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
       'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
       'ORGANIZATION_TYPE', 'EXT_SOURCE_2', 'EXT_SOURCE_3',
       'YEARS_BEGINEXPLUATATION_AVG', 'FLOORSMAX_AVG',
       'YEARS_BEGINEXPLUATATION_MODE', 'FLOORSMAX_MODE',
       'YEARS_BEGINEXPLUATATION_MEDI', 'FLOORSMAX_MEDI', 'TOTALAREA_MODE',
       'EMERGENCYSTATE_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE',
       'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE',
       'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE',
       'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4',
       'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7',
       'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10',
       'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13',
       'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16',
       'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19',
       'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21',
       'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
       'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
       'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype=object)

Column: DAYS_BIRTH¶

This column represents the client's age in days relative to the application date, displayed as negative values. For clarity and ease of understanding, we are converting these values into years. We use the -round() function to transform the negative values into positive ones. We plan to apply the same transformation method to other date-related data.

In [112]:
# Display the first few rows of the 'DAYS_BIRTH' column in the DataFrame 'train'
train['DAYS_BIRTH'].head()
Out[112]:
0    -9461
1   -16765
2   -19046
3   -19005
4   -19932
Name: DAYS_BIRTH, dtype: int64
In [114]:
# Convert the values in the 'DAYS_BIRTH' column from days to years and display the first few rows
train['DAYS_BIRTH'] = -round(train['DAYS_BIRTH'] / 365, 0)
train['DAYS_BIRTH'].head()
Out[114]:
0    26.0
1    46.0
2    52.0
3    52.0
4    55.0
Name: DAYS_BIRTH, dtype: float64
In [116]:
# Convert the values in the 'DAYS_REGISTRATION' column from days to years and display the first few rows
train['DAYS_REGISTRATION'] = -round(train['DAYS_REGISTRATION'] / 365, 0)
train['DAYS_REGISTRATION'].head()
Out[116]:
0    10.0
1     3.0
2    12.0
3    27.0
4    12.0
Name: DAYS_REGISTRATION, dtype: float64
In [118]:
# Convert the values in the 'DAYS_ID_PUBLISH' column from days to years and display the first few rows
train['DAYS_ID_PUBLISH'] = -round(train['DAYS_ID_PUBLISH'] / 365, 0)
train['DAYS_ID_PUBLISH'].head()
Out[118]:
0    6.0
1    1.0
2    7.0
3    7.0
4    9.0
Name: DAYS_ID_PUBLISH, dtype: float64
In [120]:
# Get the data types of each column in the DataFrame 'train'
train.dtypes
Out[120]:
SK_ID_CURR                      int64
TARGET                          int64
NAME_CONTRACT_TYPE             object
CODE_GENDER                    object
FLAG_OWN_CAR                   object
                               ...   
AMT_REQ_CREDIT_BUREAU_DAY     float64
AMT_REQ_CREDIT_BUREAU_WEEK    float64
AMT_REQ_CREDIT_BUREAU_MON     float64
AMT_REQ_CREDIT_BUREAU_QRT     float64
AMT_REQ_CREDIT_BUREAU_YEAR    float64
Length: 81, dtype: object

Target¶

Total users¶

In [124]:
# Count the number of non-null values in the 'TARGET' column of the DataFrame 'train'
train['TARGET'].count()
Out[124]:
np.int64(211607)

Client with payment difficulties¶

In [127]:
# Count the number of records where the value in the 'TARGET' column is equal to 1
(train['TARGET'] == 1).sum()
Out[127]:
np.int64(17206)

Other clients¶

In [130]:
# Count the number of records where the value in the 'TARGET' column is equal to 0
(train['TARGET'] == 0).sum()
Out[130]:
np.int64(194401)

Ratio¶

In [133]:
# Calculate the ratio of records with 'TARGET' value equal to 0 to records with 'TARGET' value equal to 1
(train['TARGET'] == 0).sum() / (train['TARGET'] == 1).sum()
Out[133]:
np.float64(11.298442403812624)

Too much data imbalance

Task 5 - Analysing Categorical and Numerical data.¶

Analysing categorical data¶

Since due to data imbalance, we will separate out the train with a target equal to 0 and train with a target equal to 1. And we will analyse them individually and try to find any relationship if it exists.

In [138]:
# Create a subset of the DataFrame 'train' containing records where the 'TARGET' column is equal to 0
train_0 = train.loc[train['TARGET'] == 0]

# Create a subset of the DataFrame 'train' containing records where the 'TARGET' column is equal to 1
train_1 = train.loc[train['TARGET'] == 1]

We utilized three distinct plots for analysis:¶

Pie plot:

  • This displays all the values present in a column as percentages, where the sum of these percentages equals 100.

Countplot:

  • Here, we depicted the count of different categories. Consequently, Target=0 will typically have a higher count than Target=1.

Barplot:

  • To create this plot, we initially divided the dataset into two subsets based on the target variable: Target=0 and Target=1. Next, we further subdivided each subset into different categories. We then plotted these categories as percentages. Notably, you'll observe that the values for Target=0 and Target=1 are mostly equal. For a deeper understanding, please refer to the code of this plot.
In [141]:
def plotting(train, train0, train1, column):
    """
    Plots three types of visualizations for a given column in the dataset:
    a pie chart of overall distribution, a countplot by category, and a bar plot of percentage distribution by target variable.

    Parameters:
    - train: DataFrame containing the entire dataset.
    - train0: DataFrame filtered by the target variable with value 0.
    - train1: DataFrame filtered by the target variable with value 1.
    - column: The name of the column to be visualized.
    """

    # Assigning dataframes to local variables (This step might be redundant as we can directly use the function arguments)
    train = train
    train_0 = train0
    train_1 = train1
    col = column

    # Initialize figure with a specific size
    fig = plt.figure(figsize=(13,10))

    # Create a subplot for the pie chart
    ax1 = plt.subplot(221)
    # Plotting pie chart for overall distribution of the column
    train[col].value_counts().plot.pie(autopct="%1.0f%%", ax=ax1)
    plt.title('Plotting data for the column: ' + column)

    # Create a subplot for the countplot
    ax2 = plt.subplot(222)
    # Plotting count plot by category with hue as TARGET
    sns.countplot(x=column, hue='TARGET', data=train, ax=ax2)
    plt.xticks(rotation=90)
    plt.title('Plotting data for target in terms of total count')

    # Create a subplot for the bar plot
    ax3 = plt.subplot(223)
    # Preparing data for percentage distribution by target variable
    df = pd.DataFrame()
    df['0'] = ((train_0[col].value_counts())/len(train_0))
    df['1'] = ((train_1[col].value_counts())/len(train_1))
    # Plotting bar plot for percentage distribution
    df.plot.bar(ax=ax3)
    plt.title('Plotting data for target in terms of percentage')

    # Adjust layout to prevent overlap
    fig.tight_layout()

    # Display the plots
    plt.show()
In [143]:
# Create a list to get all the Categorical columns
train_categorical = train.select_dtypes(include=['object']).columns
In [145]:
train_categorical
Out[145]:
Index(['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
       'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
       'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE',
       'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE',
       'EMERGENCYSTATE_MODE'],
      dtype='object')

Task 6.1 - Univariate Analysis of the Categorical data¶

Visualizing Data Distribution¶

Take, for instance, the column "NAME_CONTRACT_TYPE". The visualizations produced offer insightful data distributions:

Pie Chart (First Plot):¶

  • It reveals that 90% of the entries are 'Cash loans', while the remaining 10% are 'Revolving loans'. This chart provides a straightforward view of the overall category proportions within the column.

Count Plot (Second Plot):¶

  • This visualization differentiates between 'Cash loans' and 'Revolving loans' based on the TARGET variable. For TARGET=0 (no default), there are approximately 250,000 'Cash loans' and around 2,500 'Revolving loans'. The plot offers a similar breakdown for TARGET=1 (default), enabling a comparison of counts between the two TARGET categories.

Percentage Bar Plot (Third Plot):¶

  • In this plot, the data is presented in terms of percentage, focusing on the relative distribution within each TARGET category. The 'Cash loans' and 'Revolving loans' are stacked (in blue for TARGET=0 and orange for TARGET=1), with the sum of the percentages for each loan type within a TARGET category equating to 100%. This plot emphasizes the proportionate representation of each loan type within the groups defined by the TARGET variable.
In [149]:
# Convert 'TARGET' to a string data type for categorical processing.
train['TARGET'] = train['TARGET'].astype(str)

# Change 'NAME_CONTRACT_TYPE' to a categorical type for optimized storage and plotting.
train['NAME_CONTRACT_TYPE'] = train['NAME_CONTRACT_TYPE'].astype('category')

# Iterate through a list of categorical column names, creating plots for each.
for column in train_categorical:
    # Indicates the column currently being plotted.
    print("Plotting ", column)

    # Generate visualizations for the current column against the TARGET variable.
    plotting(train, train_0, train_1, column)

    # Prints a separator line for readability between plots of different columns.
    print('----------------------------------------------------------------------------------------------')
Plotting  NAME_CONTRACT_TYPE
No description has been provided for this image
----------------------------------------------------------------------------------------------
Plotting  CODE_GENDER
No description has been provided for this image
----------------------------------------------------------------------------------------------
Plotting  FLAG_OWN_CAR
No description has been provided for this image
----------------------------------------------------------------------------------------------
Plotting  FLAG_OWN_REALTY
No description has been provided for this image
----------------------------------------------------------------------------------------------
Plotting  NAME_TYPE_SUITE
No description has been provided for this image
----------------------------------------------------------------------------------------------
Plotting  NAME_INCOME_TYPE
No description has been provided for this image
----------------------------------------------------------------------------------------------
Plotting  NAME_EDUCATION_TYPE
No description has been provided for this image
----------------------------------------------------------------------------------------------
Plotting  NAME_FAMILY_STATUS
No description has been provided for this image
----------------------------------------------------------------------------------------------
Plotting  NAME_HOUSING_TYPE
No description has been provided for this image
----------------------------------------------------------------------------------------------
Plotting  OCCUPATION_TYPE
No description has been provided for this image
----------------------------------------------------------------------------------------------
Plotting  WEEKDAY_APPR_PROCESS_START
No description has been provided for this image
----------------------------------------------------------------------------------------------
Plotting  ORGANIZATION_TYPE
No description has been provided for this image
----------------------------------------------------------------------------------------------
Plotting  EMERGENCYSTATE_MODE
No description has been provided for this image
----------------------------------------------------------------------------------------------

The analysis identifies several key variables that significantly influence loan behavior and default rates:¶

CODE_GENDER:¶
  • Although loans are less frequently taken by males, the data indicates that the proportion of male defaulters is notably higher compared to females. This suggests gender as a potential factor in assessing loan risk.
NAME_INCOME_TYPE:¶
  • Pensioners show a lower default rate, implying that despite a potentially fixed or lower income, they maintain a higher reliability in repaying loans.
NAME_EDUCATION_TYPE:¶
  • Loans are predominantly sought for secondary education, followed by higher education. However, the default rate among borrowers with secondary education is substantially higher than those with higher education credentials. This points to education level as a significant determinant of loan repayment capacity.
NAME_FAMILY_STATUS:¶
  • Married individuals are the most common applicants for loans and tend to have lower default rates. In contrast, singles and those in civil marriages exhibit higher default rates, highlighting the influence of marital status on loan repayment behavior.
OCCUPATION_TYPE:¶
  • Laborers and various staff categories are the most frequent loan applicants, yet managers and high-skilled tech staff prove to be the most reliable in terms of repayment. This underscores the role of occupation type in predicting loan default risk.

Analyzing Numerical columns¶

Non-defaulter correlation¶

In [153]:
# Filter only numeric columns from train_0 for correlation analysis.
numeric_cols = train_0.select_dtypes(include=[np.number]).columns
train_0_numeric = train_0[numeric_cols]

# Handle missing values, if necessary, by filling them with the median of each column.
# train_0_numeric = train_0_numeric.fillna(train_0_numeric.median())

# Calculate the correlation matrix for the numeric columns.
corr = train_0_numeric.corr()

# Prepare a mask to hide the upper triangle of the heatmap, for readability.
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

# Plotting setup: define the figure size and style.
f, ax = plt.subplots(figsize=(11, 9))
with sns.axes_style("white"):
    # Create a heatmap with the correlation matrix, masked upper triangle, and specific aesthetic choices.
    ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True)
No description has been provided for this image

This method starts by narrowing down your DataFrame to only include columns with numeric data. Following this, it computes the correlation matrix using the filtered DataFrame.

Finding the top 10 correlation¶

In [156]:
# Select only the numeric columns from the DataFrame
numeric_df = train_0.select_dtypes(include=[np.number])

# Optionally, you can handle missing values if necessary, for example, by filling them
# numeric_df = numeric_df.fillna(method='ffill').fillna(method='bfill')

# Now calculate the absolute correlation matrix for the numeric columns
correlation_0 = numeric_df.corr().abs()

# Unstack and sort the correlation pairs
correlation_0 = correlation_0.unstack().sort_values(kind="quicksort", ascending=False)

# Optionally, drop NA values if there were any missing correlations due to missing data
correlation_0 = correlation_0.dropna()

# Display or process your sorted correlation pairs
print(correlation_0)
FLAG_EMAIL                  FLAG_EMAIL                    1.000000
AMT_REQ_CREDIT_BUREAU_QRT   AMT_REQ_CREDIT_BUREAU_QRT     1.000000
AMT_REQ_CREDIT_BUREAU_MON   AMT_REQ_CREDIT_BUREAU_MON     1.000000
DAYS_ID_PUBLISH             DAYS_ID_PUBLISH               1.000000
FLOORSMAX_MEDI              FLOORSMAX_MEDI                1.000000
                                                            ...   
FLAG_DOCUMENT_2             FLAG_MOBIL                    0.000012
FLAG_DOCUMENT_9             AMT_REQ_CREDIT_BUREAU_WEEK    0.000010
AMT_REQ_CREDIT_BUREAU_WEEK  FLAG_DOCUMENT_9               0.000010
FLAG_DOCUMENT_12            FLAG_MOBIL                    0.000007
FLAG_MOBIL                  FLAG_DOCUMENT_12              0.000007
Length: 4457, dtype: float64

Top correlations¶

''' EXT_SOURCE_1 DAYS_BIRTH 0.601210

DAYS_EMPLOYED DAYS_BIRTH 0.618048

AMT_CREDIT AMT_ANNUITY 0.771309

AMT_GOODS_PRICE AMT_ANNUITY 0.776686

LIVE_CITY_NOT_WORK_CITY REG_CITY_NOT_WORK_CITY 0.830381

LIVE_REGION_NOT_WORK_REGION REG_REGION_NOT_WORK_REGION 0.861861

CNT_FAM_MEMBERS CNT_CHILDREN 0.878571

REGION_RATING_CLIENT_W_CITY REGION_RATING_CLIENT 0.950149

AMT_CREDIT AMT_GOODS_PRICE 0.987250

DAYS_EMPLOYED FLAG_EMP_PHONE 0.999758 '''

Top correlations for the defaulter¶

''' EXT_SOURCE_1 DAYS_BIRTH 0.570054

DAYS_EMPLOYED DAYS_BIRTH 0.575097

FLAG_EMP_PHONE DAYS_BIRTH 0.578519

AMT_CREDIT AMT_ANNUITY 0.752195

AMT_GOODS_PRICE AMT_ANNUITY 0.752699

REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY 0.778540

REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION 0.847885

CNT_CHILDREN CNT_FAM_MEMBERS 0.885484

REGION_RATING_CLIENT_W_CITY REGION_RATING_CLIENT 0.956637

AMT_CREDIT AMT_GOODS_PRICE 0.983103

DAYS_EMPLOYED FLAG_EMP_PHONE 0.999702 '''

Defaulter correlation¶

In [160]:
# Step 1: Select only numeric columns from the DataFrame
numeric_train_1 = train_1.select_dtypes(include=[np.number])

# Step 2: Optionally handle missing values. For example, you could fill them with zeros
# numeric_train_1.fillna(0, inplace=True)

# Now you can safely calculate the correlation matrix
corr = numeric_train_1.corr()

# If you wish to continue with masking the upper triangle for plotting:
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(11, 9))
with sns.axes_style("white"):
    ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True)
No description has been provided for this image
In [161]:
# Select only numeric columns from 'train_1'
numeric_train_1 = train_1.select_dtypes(include=[np.number])

# Optional: Handle missing values. For example, you can fill them with the column mean
# numeric_train_1.fillna(numeric_train_1.mean(), inplace=True)

# Now, compute the absolute correlation matrix
correlation_1 = numeric_train_1.corr().abs()

# If you want to unstack and sort the correlation pairs
correlation_pairs = correlation_1.unstack().sort_values(kind="quicksort", ascending=False)

# Optional: Drop NaN values if any exist
correlation_pairs = correlation_pairs.dropna()
correlation_1 = correlation_1[correlation_1 != 1.0]

print(correlation_1)
                            SK_ID_CURR  TARGET  CNT_CHILDREN  \
SK_ID_CURR                         NaN     NaN      0.004849   
TARGET                             NaN     NaN           NaN   
CNT_CHILDREN                  0.004849     NaN           NaN   
AMT_INCOME_TOTAL              0.011458     NaN      0.006184   
AMT_CREDIT                    0.005594     NaN      0.005252   
...                                ...     ...           ...   
AMT_REQ_CREDIT_BUREAU_DAY     0.013174     NaN      0.016030   
AMT_REQ_CREDIT_BUREAU_WEEK    0.007371     NaN      0.012206   
AMT_REQ_CREDIT_BUREAU_MON     0.013882     NaN      0.013042   
AMT_REQ_CREDIT_BUREAU_QRT     0.004590     NaN      0.016741   
AMT_REQ_CREDIT_BUREAU_YEAR    0.008739     NaN      0.037146   

                            AMT_INCOME_TOTAL  AMT_CREDIT  AMT_ANNUITY  \
SK_ID_CURR                          0.011458    0.005594     0.001134   
TARGET                                   NaN         NaN          NaN   
CNT_CHILDREN                        0.006184    0.005252     0.039927   
AMT_INCOME_TOTAL                         NaN    0.031498     0.037984   
AMT_CREDIT                          0.031498         NaN     0.748706   
...                                      ...         ...          ...   
AMT_REQ_CREDIT_BUREAU_DAY           0.000669    0.004442     0.001187   
AMT_REQ_CREDIT_BUREAU_WEEK          0.000800    0.004589     0.029471   
AMT_REQ_CREDIT_BUREAU_MON           0.003327    0.048590     0.055094   
AMT_REQ_CREDIT_BUREAU_QRT           0.001764    0.011945     0.006426   
AMT_REQ_CREDIT_BUREAU_YEAR          0.000334    0.027330     0.013669   

                            AMT_GOODS_PRICE  REGION_POPULATION_RELATIVE  \
SK_ID_CURR                         0.004222                    0.007247   
TARGET                                  NaN                         NaN   
CNT_CHILDREN                       0.001340                    0.032758   
AMT_INCOME_TOTAL                   0.030594                    0.004610   
AMT_CREDIT                         0.982987                    0.062132   
...                                     ...                         ...   
AMT_REQ_CREDIT_BUREAU_DAY          0.005247                    0.009722   
AMT_REQ_CREDIT_BUREAU_WEEK         0.006120                    0.000390   
AMT_REQ_CREDIT_BUREAU_MON          0.051123                    0.075086   
AMT_REQ_CREDIT_BUREAU_QRT          0.011643                    0.003196   
AMT_REQ_CREDIT_BUREAU_YEAR         0.029737                    0.002434   

                            DAYS_BIRTH  DAYS_EMPLOYED  ...  FLAG_DOCUMENT_18  \
SK_ID_CURR                    0.005229       0.002068  ...          0.000728   
TARGET                             NaN            NaN  ...               NaN   
CNT_CHILDREN                  0.261010       0.195754  ...          0.001622   
AMT_INCOME_TOTAL              0.003528       0.013976  ...          0.002448   
AMT_CREDIT                    0.132601       0.001101  ...          0.001719   
...                                ...            ...  ...               ...   
AMT_REQ_CREDIT_BUREAU_DAY     0.007128       0.022035  ...          0.001786   
AMT_REQ_CREDIT_BUREAU_WEEK    0.004460       0.012727  ...          0.014177   
AMT_REQ_CREDIT_BUREAU_MON     0.006219       0.031927  ...          0.004830   
AMT_REQ_CREDIT_BUREAU_QRT     0.016899       0.014287  ...          0.006451   
AMT_REQ_CREDIT_BUREAU_YEAR    0.091058       0.047222  ...          0.049998   

                            FLAG_DOCUMENT_19  FLAG_DOCUMENT_20  \
SK_ID_CURR                          0.002898          0.000217   
TARGET                                   NaN               NaN   
CNT_CHILDREN                        0.002098          0.006140   
AMT_INCOME_TOTAL                    0.000517          0.000082   
AMT_CREDIT                          0.007285          0.017112   
...                                      ...               ...   
AMT_REQ_CREDIT_BUREAU_DAY           0.001881          0.001774   
AMT_REQ_CREDIT_BUREAU_WEEK          0.004350          0.004101   
AMT_REQ_CREDIT_BUREAU_MON           0.007804          0.005118   
AMT_REQ_CREDIT_BUREAU_QRT           0.010317          0.008942   
AMT_REQ_CREDIT_BUREAU_YEAR          0.013165          0.013935   

                            FLAG_DOCUMENT_21  AMT_REQ_CREDIT_BUREAU_HOUR  \
SK_ID_CURR                          0.009651                    0.017873   
TARGET                                   NaN                         NaN   
CNT_CHILDREN                        0.004520                    0.001600   
AMT_INCOME_TOTAL                    0.000652                    0.000085   
AMT_CREDIT                          0.017567                    0.002464   
...                                      ...                         ...   
AMT_REQ_CREDIT_BUREAU_DAY           0.001659                    0.243735   
AMT_REQ_CREDIT_BUREAU_WEEK          0.003836                    0.001391   
AMT_REQ_CREDIT_BUREAU_MON           0.006882                    0.002580   
AMT_REQ_CREDIT_BUREAU_QRT           0.005870                    0.007778   
AMT_REQ_CREDIT_BUREAU_YEAR          0.001417                    0.008381   

                            AMT_REQ_CREDIT_BUREAU_DAY  \
SK_ID_CURR                                   0.013174   
TARGET                                            NaN   
CNT_CHILDREN                                 0.016030   
AMT_INCOME_TOTAL                             0.000669   
AMT_CREDIT                                   0.004442   
...                                               ...   
AMT_REQ_CREDIT_BUREAU_DAY                         NaN   
AMT_REQ_CREDIT_BUREAU_WEEK                   0.192201   
AMT_REQ_CREDIT_BUREAU_MON                    0.011443   
AMT_REQ_CREDIT_BUREAU_QRT                    0.003838   
AMT_REQ_CREDIT_BUREAU_YEAR                   0.004587   

                            AMT_REQ_CREDIT_BUREAU_WEEK  \
SK_ID_CURR                                    0.007371   
TARGET                                             NaN   
CNT_CHILDREN                                  0.012206   
AMT_INCOME_TOTAL                              0.000800   
AMT_CREDIT                                    0.004589   
...                                                ...   
AMT_REQ_CREDIT_BUREAU_DAY                     0.192201   
AMT_REQ_CREDIT_BUREAU_WEEK                         NaN   
AMT_REQ_CREDIT_BUREAU_MON                     0.010031   
AMT_REQ_CREDIT_BUREAU_QRT                     0.006023   
AMT_REQ_CREDIT_BUREAU_YEAR                    0.022764   

                            AMT_REQ_CREDIT_BUREAU_MON  \
SK_ID_CURR                                   0.013882   
TARGET                                            NaN   
CNT_CHILDREN                                 0.013042   
AMT_INCOME_TOTAL                             0.003327   
AMT_CREDIT                                   0.048590   
...                                               ...   
AMT_REQ_CREDIT_BUREAU_DAY                    0.011443   
AMT_REQ_CREDIT_BUREAU_WEEK                   0.010031   
AMT_REQ_CREDIT_BUREAU_MON                         NaN   
AMT_REQ_CREDIT_BUREAU_QRT                    0.002143   
AMT_REQ_CREDIT_BUREAU_YEAR                   0.001196   

                            AMT_REQ_CREDIT_BUREAU_QRT  \
SK_ID_CURR                                   0.004590   
TARGET                                            NaN   
CNT_CHILDREN                                 0.016741   
AMT_INCOME_TOTAL                             0.001764   
AMT_CREDIT                                   0.011945   
...                                               ...   
AMT_REQ_CREDIT_BUREAU_DAY                    0.003838   
AMT_REQ_CREDIT_BUREAU_WEEK                   0.006023   
AMT_REQ_CREDIT_BUREAU_MON                    0.002143   
AMT_REQ_CREDIT_BUREAU_QRT                         NaN   
AMT_REQ_CREDIT_BUREAU_YEAR                   0.094318   

                            AMT_REQ_CREDIT_BUREAU_YEAR  
SK_ID_CURR                                    0.008739  
TARGET                                             NaN  
CNT_CHILDREN                                  0.037146  
AMT_INCOME_TOTAL                              0.000334  
AMT_CREDIT                                    0.027330  
...                                                ...  
AMT_REQ_CREDIT_BUREAU_DAY                     0.004587  
AMT_REQ_CREDIT_BUREAU_WEEK                    0.022764  
AMT_REQ_CREDIT_BUREAU_MON                     0.001196  
AMT_REQ_CREDIT_BUREAU_QRT                     0.094318  
AMT_REQ_CREDIT_BUREAU_YEAR                         NaN  

[68 rows x 68 columns]

Analysing through box plot¶

In [163]:
# Select columns with numerical data types ('int64' and 'float64') from the DataFrame 'train'
train_categorical = train.select_dtypes(include=['int64', 'float64']).columns

Task 6.2 - Univariate Analysis for numerical data¶

For univariate analysis of the numerical columns, we will plot the histogram and the distribution plot.

In [170]:
# Iterate over each categorical column to generate plots
for column in train_categorical:
    # Construct and print the title for current plot
    title = "Plot of " + column
    print(title)

    # Plot histograms for the distribution of the variable for both TARGET categories
    plt.hist(train_0[column], alpha=0.5, label='Target=0')  # Histogram for category '0'
    plt.hist(train_1[column], alpha=0.5, label='Target=1')  # Histogram for category '1'
    plt.legend()  # Add a legend to distinguish between categories
    plt.show()  # Display the histogram

    # Plot distribution plots for the non-null values in both TARGET categories
    sns.distplot(train_0[column].dropna(), label='Target=0', kde=False, norm_hist=True)  # Distribution plot for '0'
    sns.distplot(train_1[column].dropna(), label='Target=1', kde=False, norm_hist=True)  # Distribution plot for '1'
    plt.legend()  # Add a legend to distinguish between categories
    plt.show()  # Display the distribution plot

    # Placeholder for a box plot function that might be defined elsewhere
    # box_plot(train_0, train_1, column)

    # Print a separator for readability between plots
    print("------------------------------------------------------------------------")
Plot of SK_ID_CURR
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of CNT_CHILDREN
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of AMT_INCOME_TOTAL
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of AMT_CREDIT
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of AMT_ANNUITY
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of AMT_GOODS_PRICE
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of REGION_POPULATION_RELATIVE
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of DAYS_BIRTH
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of DAYS_EMPLOYED
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of DAYS_REGISTRATION
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of DAYS_ID_PUBLISH
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_MOBIL
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_EMP_PHONE
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_WORK_PHONE
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_CONT_MOBILE
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_PHONE
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_EMAIL
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of CNT_FAM_MEMBERS
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of REGION_RATING_CLIENT
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of REGION_RATING_CLIENT_W_CITY
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of HOUR_APPR_PROCESS_START
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of REG_REGION_NOT_LIVE_REGION
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of REG_REGION_NOT_WORK_REGION
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of LIVE_REGION_NOT_WORK_REGION
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of REG_CITY_NOT_LIVE_CITY
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of REG_CITY_NOT_WORK_CITY
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of LIVE_CITY_NOT_WORK_CITY
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of EXT_SOURCE_2
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of EXT_SOURCE_3
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of YEARS_BEGINEXPLUATATION_AVG
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLOORSMAX_AVG
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of YEARS_BEGINEXPLUATATION_MODE
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLOORSMAX_MODE
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of YEARS_BEGINEXPLUATATION_MEDI
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLOORSMAX_MEDI
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of TOTALAREA_MODE
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of OBS_30_CNT_SOCIAL_CIRCLE
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of DEF_30_CNT_SOCIAL_CIRCLE
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of OBS_60_CNT_SOCIAL_CIRCLE
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of DEF_60_CNT_SOCIAL_CIRCLE
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of DAYS_LAST_PHONE_CHANGE
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_2
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_3
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_4
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_5
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_6
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_7
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_8
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_9
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_10
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_11
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_12
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_13
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_14
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_15
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_16
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_17
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_18
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_19
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_20
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of FLAG_DOCUMENT_21
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of AMT_REQ_CREDIT_BUREAU_HOUR
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of AMT_REQ_CREDIT_BUREAU_DAY
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of AMT_REQ_CREDIT_BUREAU_WEEK
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of AMT_REQ_CREDIT_BUREAU_MON
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of AMT_REQ_CREDIT_BUREAU_QRT
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------
Plot of AMT_REQ_CREDIT_BUREAU_YEAR
No description has been provided for this image
No description has been provided for this image
------------------------------------------------------------------------

The columns which seems different where both the histogram and distribution plot are different for the target = 0 and target = 1 are:

  • AMT_CREDIT:
  • AMT_ANNUITY:
  • AMT_GOODS_PRICE
  • DAYS_BIRTH
  • HOURS_APPR_PROCESS_START
  • EXT_SOUCE_2
  • EXT_SOURCE_3
  • AMT_REQ_CREDIT_BUREAU_YEAR

Reading previous application¶

In [173]:
# Read the data from the file 'previous_application.csv' into a DataFrame
previous_application = pd.read_csv('previous_application.csv')

# Display the first few rows of the DataFrame 'previous_application'
previous_application.head()
Out[173]:
SK_ID_PREV SK_ID_CURR NAME_CONTRACT_TYPE AMT_ANNUITY AMT_APPLICATION AMT_CREDIT AMT_DOWN_PAYMENT AMT_GOODS_PRICE WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START ... NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 2030495 271877 Consumer loans 1730.430 17145.0 17145.0 0.0 17145.0 SATURDAY 15 ... Connectivity 12.0 middle POS mobile with interest 365243.0 -42.0 300.0 -42.0 -37.0 0.0
1 2802425 108129 Cash loans 25188.615 607500.0 679671.0 NaN 607500.0 THURSDAY 11 ... XNA 36.0 low_action Cash X-Sell: low 365243.0 -134.0 916.0 365243.0 365243.0 1.0
2 2523466 122040 Cash loans 15060.735 112500.0 136444.5 NaN 112500.0 TUESDAY 11 ... XNA 12.0 high Cash X-Sell: high 365243.0 -271.0 59.0 365243.0 365243.0 1.0
3 2819243 176158 Cash loans 47041.335 450000.0 470790.0 NaN 450000.0 MONDAY 7 ... XNA 12.0 middle Cash X-Sell: middle 365243.0 -482.0 -152.0 -182.0 -177.0 1.0
4 1784265 202054 Cash loans 31924.395 337500.0 404055.0 NaN 337500.0 THURSDAY 9 ... XNA 24.0 high Cash Street: high NaN NaN NaN NaN NaN NaN

5 rows × 37 columns

In [174]:
# Shape of previous application
previous_application.shape
Out[174]:
(1670214, 37)

There are duplicate 'SK_ID_CURR' as a person could have taken loan multiple times¶

In [176]:
# Number of unique id in previous application
previous_application.SK_ID_PREV.value_counts()
Out[176]:
SK_ID_PREV
2418762    1
2030495    1
2802425    1
2523466    1
2819243    1
          ..
2579447    1
1715995    1
2257824    1
2330894    1
1397919    1
Name: count, Length: 1670214, dtype: int64
In [177]:
# Number of unique id in previous application
previous_application.SK_ID_CURR.value_counts()
Out[177]:
SK_ID_CURR
187868    77
265681    73
173680    72
242412    68
206783    67
          ..
123460     1
353874     1
302277     1
316831     1
363167     1
Name: count, Length: 338857, dtype: int64

From the information provided, the shape of the previous application dataset is (1670214, 37), and the count of unique SK_ID_PREV entries matches this, standing at 1670214. However, the count of unique SK_ID_CURR entries is significantly lower, at 338857. This discrepancy indicates that the SK_ID_PREV values contain duplicates, suggesting that multiple entries pertain to the same current application ID (SK_ID_CURR).

Task 7 - Analysis for the outliers¶

Potting the numerial data based on the index and analysing if there are outliers in any of the column.

In [181]:
# Plot each numerical column against the index of the DataFrame 'train'
for column in train_categorical:
    title = "Plot of " + column
    plt.scatter(train.index, train[column])
    plt.title(title)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

The examination of outliers primarily hinges on the specifics of the business problem. Assuming the data is error-free, the decision to exclude any extreme values is dictated by business requirements. A review of various plots indicates the data reporting appears accurate. Hence, rather than eliminating these points, we will highlight the columns containing extreme values for further analysis.

Columns warranting further investigation include:

CNT_CHILDREN:

  • This denotes the number of children a client has. Instances of clients with more than 10 children are considered extreme.

AMT_INCOME_TOTAL:

  • This represents the client's income. A particular value around 1.2e8 is notably distant from the rest, suggesting it may be an outlier suitable for removal.

FLAG_MOBILE:

  • This identifies whether a client possesses a mobile phone. Its absence may or may not be considered an outlier. Other columns such as OBS_30_CNT_SOCIAL_CIRCLE, DEF_30_CNT_SOCIAL_CIRCLE, and OBS_60_CNT_SOCIAL_CIRCLE also merit attention.

Converting a numerial data to categorical for analysis¶

In [184]:
# Plot a distribution plot for the 'AMT_ANNUITY' column after removing any missing values
sns.distplot(train['AMT_ANNUITY'].dropna())
Out[184]:
<Axes: xlabel='AMT_ANNUITY', ylabel='Density'>
No description has been provided for this image
In [185]:
def amt_annuity(x):
    if x <= 20000:
        return 'low'
    elif x > 20000 and x <= 50000:
        return 'medium'
    elif x > 50000 and x <= 100000:
        return 'high'
    else:
        return 'very high'

# Apply the 'amt_annuity' function to create a new categorical column 'amt_annuity_categorical'
train['amt_annuity_categorical'] = train['AMT_ANNUITY'].apply(lambda x: amt_annuity(x))
In [186]:
# Plot a count plot of the 'amt_annuity_categorical' column
sns.countplot(x='amt_annuity_categorical', data=train)
Out[186]:
<Axes: xlabel='amt_annuity_categorical', ylabel='count'>
No description has been provided for this image

Task 8 - Merging Datasets¶

Merging DataFrames: Train and Previous Application Based on SK_ID_PREV¶

After merging both dataframes using the SK_ID_PREV column as the key, the resulting dataframe will also contain duplicate SK_ID_PREV values. This duplication is not an issue, as our objective is to explore patterns, including cases where a lender has previously taken out a loan more than once. Retaining these duplicates allows us to analyze the data comprehensively and identify any recurring trends or behaviors among borrowers with multiple loan histories.

In [189]:
# Merge 'train' DataFrame with 'previous_application' DataFrame based on 'SK_ID_CURR'
# Using 'inner' join to retain only common rows between the two DataFrames
previous_train = train.merge(previous_application, left_on='SK_ID_CURR', right_on='SK_ID_CURR', how='inner')
In [190]:
previous_train.head()
Out[190]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE_x CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT_x AMT_ANNUITY_x ... NAME_SELLER_INDUSTRY CNT_PAYMENT NAME_YIELD_GROUP PRODUCT_COMBINATION DAYS_FIRST_DRAWING DAYS_FIRST_DUE DAYS_LAST_DUE_1ST_VERSION DAYS_LAST_DUE DAYS_TERMINATION NFLAG_INSURED_ON_APPROVAL
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... Auto technology 24.0 low_normal POS other with interest 365243.0 -565.0 125.0 -25.0 -17.0 0.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... XNA 12.0 low_normal Cash X-Sell: low 365243.0 -716.0 -386.0 -536.0 -527.0 1.0
2 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... Furniture 6.0 middle POS industry with interest 365243.0 -797.0 -647.0 -647.0 -639.0 0.0
3 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... Consumer electronics 12.0 middle POS household with interest 365243.0 -2310.0 -1980.0 -1980.0 -1976.0 1.0
4 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... Connectivity 4.0 middle POS mobile without interest 365243.0 -784.0 -694.0 -724.0 -714.0 0.0

5 rows × 118 columns

In [191]:
# Get the names of all columns in the DataFrame 'previous_application'
previous_application.columns.values
Out[191]:
array(['SK_ID_PREV', 'SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'AMT_ANNUITY',
       'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT',
       'AMT_GOODS_PRICE', 'WEEKDAY_APPR_PROCESS_START',
       'HOUR_APPR_PROCESS_START', 'FLAG_LAST_APPL_PER_CONTRACT',
       'NFLAG_LAST_APPL_IN_DAY', 'RATE_DOWN_PAYMENT',
       'RATE_INTEREST_PRIMARY', 'RATE_INTEREST_PRIVILEGED',
       'NAME_CASH_LOAN_PURPOSE', 'NAME_CONTRACT_STATUS', 'DAYS_DECISION',
       'NAME_PAYMENT_TYPE', 'CODE_REJECT_REASON', 'NAME_TYPE_SUITE',
       'NAME_CLIENT_TYPE', 'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO',
       'NAME_PRODUCT_TYPE', 'CHANNEL_TYPE', 'SELLERPLACE_AREA',
       'NAME_SELLER_INDUSTRY', 'CNT_PAYMENT', 'NAME_YIELD_GROUP',
       'PRODUCT_COMBINATION', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE',
       'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION',
       'NFLAG_INSURED_ON_APPROVAL'], dtype=object)

The merged dataframe also has multiple values for SK_ID_CURR¶

In [193]:
# Count the occurrences of each unique value in the 'SK_ID_CURR' column of the DataFrame 'previous_application' and display the top values
previous_application['SK_ID_CURR'].value_counts().head()
Out[193]:
SK_ID_CURR
187868    77
265681    73
173680    72
242412    68
206783    67
Name: count, dtype: int64

Segregating the dataset on Target = 0 and Target = 1¶

In [195]:
# Create a subset of the DataFrame 'train' containing records where the 'TARGET' column is equal to '0'
train_0 = train.loc[train['TARGET'] == '0']

# Create a subset of the DataFrame 'train' containing records where the 'TARGET' column is equal to '1'
train_1 = train.loc[train['TARGET'] == '1']
In [196]:
# Create a subset of the DataFrame 'previous_train' containing records where the 'TARGET' column is equal to '0'
ptrain_0 = previous_train.loc[previous_train['TARGET'] == '0']

# Create a subset of the DataFrame 'previous_train' containing records where the 'TARGET' column is equal to '1'
ptrain_1 = previous_train.loc[previous_train['TARGET'] == '1']

Plotting data¶

In [198]:
def plotting(column, hue):
    # Assign column and hue parameters to local variables
    col = column
    hue = hue

    # Create a figure for the plots with a specific size
    fig = plt.figure(figsize=(13,10))

    # Subplot 1: Pie chart showing the distribution of values in the column
    ax1 = plt.subplot(221)
    train[col].value_counts().plot.pie(autopct="%1.0f%%", ax=ax1)
    plt.title('Distribution of values for the column: '+ column)

    # Subplot 2: Bar plot displaying the distribution of values by target categories
    ax2 = plt.subplot(222)
    df = pd.DataFrame()
    df['0'] = ((train_0[col].value_counts()) / len(train_0))
    df['1'] = ((train_1[col].value_counts()) / len(train_1))
    df.plot.bar(ax=ax2)
    plt.title('Distribution of values by target category')

    # Subplot 3: Count plot showing the distribution of values for Target=0
    ax3 = plt.subplot(223)
    sns.countplot(x=col, hue=hue, data=ptrain_0, ax=ax3)
    plt.xticks(rotation=90)
    plt.title('Distribution of values for Target=0')

    # Subplot 4: Count plot showing the distribution of values for Target=1
    ax4 = plt.subplot(224)
    sns.countplot(x=col, hue=hue, data=ptrain_1, ax=ax4)
    plt.xticks(rotation=90)
    plt.title('Distribution of values for Target=1')

    # Adjust layout to prevent overlap
    fig.tight_layout()

    # Display the plots
    plt.show()

Task 9 - Bivariate Analysis¶

Here, plotting only for 3 columns, as plotting in loop for all columns was pretty heavy for this size of dataset.

Plotting NAME_EDUCATION_TYPE¶

In [202]:
plotting('NAME_EDUCATION_TYPE','NAME_CONTRACT_STATUS')
No description has been provided for this image

Observation: People tend to make more loan for 'Secondary special' and their loan is also approved.

In [204]:
plotting('NAME_FAMILY_STATUS','NAME_CONTRACT_STATUS')
No description has been provided for this image

Observation: Married people are more likely to pay back their loans on time compared to single people, as indicated by the differences in loan approval statuses.

Plotting: NAME_HOUSING_TYPE¶

In [207]:
plotting('NAME_HOUSING_TYPE','NAME_CONTRACT_STATUS')
No description has been provided for this image

Observation: There's a distinct variation among the approval statuses ("Approved, Refused, Unused, and Cancelled") concerning the "House/apartment" category.

Plotting: ORGANIZATION_TYPE¶

In [210]:
# Plot a pie chart to visualize the distribution of values in the 'ORGANIZATION_TYPE' column
col = 'ORGANIZATION_TYPE'
train[col].value_counts().plot.pie(autopct="%1.0f%%")
plt.title("Distribution of " + col)
plt.show()
No description has been provided for this image
In [211]:
value_counts = train[col].value_counts(normalize=True) * 100  # Calculate percentage of each value count

# Filter out value counts that are more than 2 percent
value_counts_filtered = value_counts[value_counts > 2]

# Plot pie chart only if there are values with more than 2 percent count
if not value_counts_filtered.empty:
    value_counts_filtered.plot.pie(autopct="%1.0f%%")
    plt.title(f"Distribution of {col} (more than 2%)")
    plt.show()
No description has been provided for this image
In [212]:
# Define the column of interest
col = 'ORGANIZATION_TYPE'

# Create an empty DataFrame to hold the proportions of each category for each target group
df = pd.DataFrame()

# Calculate the proportions of each category for Target=0 and Target=1 and store them in the DataFrame
df['0'] = train_0[col].value_counts(normalize=True)  # Proportions for Target=0
df['1'] = train_1[col].value_counts(normalize=True)  # Proportions for Target=1

# Set the figure size for the bar plot
sns.set(rc={'figure.figsize':(15,5)})

# Plot a bar plot showing the proportions of each category for both Target=0 and Target=1
df.plot.bar()
Out[212]:
<Axes: xlabel='ORGANIZATION_TYPE'>
No description has been provided for this image
In [213]:
# Define the column of interest
col = 'ORGANIZATION_TYPE'
# Define the hue column for differentiation
hue = 'NAME_CONTRACT_STATUS'

# Create a count plot showing the distribution of values in 'col' column, differentiated by 'hue' column, for Target=0
sns.countplot(x=col, hue=hue, data=ptrain_0)

# Rotate the x-axis labels for better readability
plt.xticks(rotation=90)
Out[213]:
([0,
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  24,
  25,
  26,
  27,
  28,
  29,
  30,
  31,
  32,
  33,
  34,
  35,
  36,
  37,
  38,
  39,
  40,
  41,
  42,
  43,
  44,
  45,
  46,
  47,
  48,
  49,
  50,
  51,
  52,
  53,
  54,
  55,
  56,
  57],
 [Text(0, 0, 'School'),
  Text(1, 0, 'Government'),
  Text(2, 0, 'Business Entity Type 3'),
  Text(3, 0, 'Religion'),
  Text(4, 0, 'Other'),
  Text(5, 0, 'XNA'),
  Text(6, 0, 'Electricity'),
  Text(7, 0, 'Medicine'),
  Text(8, 0, 'Business Entity Type 2'),
  Text(9, 0, 'Self-employed'),
  Text(10, 0, 'Transport: type 2'),
  Text(11, 0, 'Construction'),
  Text(12, 0, 'Housing'),
  Text(13, 0, 'Kindergarten'),
  Text(14, 0, 'Trade: type 7'),
  Text(15, 0, 'Industry: type 11'),
  Text(16, 0, 'Military'),
  Text(17, 0, 'Services'),
  Text(18, 0, 'Security Ministries'),
  Text(19, 0, 'Transport: type 4'),
  Text(20, 0, 'Industry: type 1'),
  Text(21, 0, 'Emergency'),
  Text(22, 0, 'Security'),
  Text(23, 0, 'Trade: type 2'),
  Text(24, 0, 'University'),
  Text(25, 0, 'Police'),
  Text(26, 0, 'Business Entity Type 1'),
  Text(27, 0, 'Postal'),
  Text(28, 0, 'Transport: type 3'),
  Text(29, 0, 'Agriculture'),
  Text(30, 0, 'Restaurant'),
  Text(31, 0, 'Culture'),
  Text(32, 0, 'Hotel'),
  Text(33, 0, 'Industry: type 7'),
  Text(34, 0, 'Industry: type 3'),
  Text(35, 0, 'Bank'),
  Text(36, 0, 'Trade: type 3'),
  Text(37, 0, 'Industry: type 9'),
  Text(38, 0, 'Trade: type 6'),
  Text(39, 0, 'Industry: type 2'),
  Text(40, 0, 'Transport: type 1'),
  Text(41, 0, 'Industry: type 12'),
  Text(42, 0, 'Industry: type 4'),
  Text(43, 0, 'Insurance'),
  Text(44, 0, 'Mobile'),
  Text(45, 0, 'Trade: type 1'),
  Text(46, 0, 'Industry: type 5'),
  Text(47, 0, 'Industry: type 10'),
  Text(48, 0, 'Legal Services'),
  Text(49, 0, 'Trade: type 5'),
  Text(50, 0, 'Cleaning'),
  Text(51, 0, 'Industry: type 13'),
  Text(52, 0, 'Trade: type 4'),
  Text(53, 0, 'Telecom'),
  Text(54, 0, 'Industry: type 8'),
  Text(55, 0, 'Realtor'),
  Text(56, 0, 'Advertising'),
  Text(57, 0, 'Industry: type 6')])
No description has been provided for this image
In [214]:
# Define the column of interest
col = 'ORGANIZATION_TYPE'
# Define the hue column for differentiation
hue = 'NAME_CONTRACT_STATUS'

# Create a count plot showing the distribution of values in 'col' column, differentiated by 'hue' column, for Target=1
sns.countplot(x=col, hue=hue, data=ptrain_1)

# Rotate the x-axis labels for better readability
plt.xticks(rotation=90)
Out[214]:
([0,
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  24,
  25,
  26,
  27,
  28,
  29,
  30,
  31,
  32,
  33,
  34,
  35,
  36,
  37,
  38,
  39,
  40,
  41,
  42,
  43,
  44,
  45,
  46,
  47,
  48,
  49,
  50,
  51,
  52,
  53,
  54,
  55,
  56,
  57],
 [Text(0, 0, 'Business Entity Type 3'),
  Text(1, 0, 'Self-employed'),
  Text(2, 0, 'Industry: type 4'),
  Text(3, 0, 'Trade: type 2'),
  Text(4, 0, 'Construction'),
  Text(5, 0, 'Other'),
  Text(6, 0, 'Trade: type 3'),
  Text(7, 0, 'XNA'),
  Text(8, 0, 'Industry: type 3'),
  Text(9, 0, 'Industry: type 9'),
  Text(10, 0, 'Security'),
  Text(11, 0, 'Medicine'),
  Text(12, 0, 'Agriculture'),
  Text(13, 0, 'School'),
  Text(14, 0, 'Transport: type 2'),
  Text(15, 0, 'Trade: type 7'),
  Text(16, 0, 'Government'),
  Text(17, 0, 'Transport: type 4'),
  Text(18, 0, 'Industry: type 1'),
  Text(19, 0, 'Advertising'),
  Text(20, 0, 'Electricity'),
  Text(21, 0, 'Business Entity Type 1'),
  Text(22, 0, 'Legal Services'),
  Text(23, 0, 'Housing'),
  Text(24, 0, 'Industry: type 12'),
  Text(25, 0, 'Security Ministries'),
  Text(26, 0, 'Kindergarten'),
  Text(27, 0, 'Services'),
  Text(28, 0, 'Industry: type 11'),
  Text(29, 0, 'Industry: type 13'),
  Text(30, 0, 'Business Entity Type 2'),
  Text(31, 0, 'Industry: type 7'),
  Text(32, 0, 'Postal'),
  Text(33, 0, 'Police'),
  Text(34, 0, 'Bank'),
  Text(35, 0, 'Industry: type 2'),
  Text(36, 0, 'Military'),
  Text(37, 0, 'Trade: type 6'),
  Text(38, 0, 'Transport: type 3'),
  Text(39, 0, 'Restaurant'),
  Text(40, 0, 'University'),
  Text(41, 0, 'Trade: type 1'),
  Text(42, 0, 'Telecom'),
  Text(43, 0, 'Emergency'),
  Text(44, 0, 'Cleaning'),
  Text(45, 0, 'Industry: type 6'),
  Text(46, 0, 'Realtor'),
  Text(47, 0, 'Culture'),
  Text(48, 0, 'Hotel'),
  Text(49, 0, 'Mobile'),
  Text(50, 0, 'Insurance'),
  Text(51, 0, 'Industry: type 5'),
  Text(52, 0, 'Trade: type 5'),
  Text(53, 0, 'Religion'),
  Text(54, 0, 'Transport: type 1'),
  Text(55, 0, 'Industry: type 10'),
  Text(56, 0, 'Industry: type 8'),
  Text(57, 0, 'Trade: type 4')])
No description has been provided for this image

This column appears to be the most significant. Individuals associated with "Business Entity Type 3" and those who are self-employed tend to have the highest default rates. The univariate analysis of this dataset yielded more insightful results compared to the bivariate analysis.

Conclusion¶

Through the series of tasks performed, we gained valuable insights into the loan applicant dataset. We successfully identified missing values, explored the distribution of variables, analyzed outliers, and examined relationships between different attributes.

This analysis provides a foundational understanding of the dataset and its potential implications for loan management and risk assessment. These insights can guide decision-making processes, such as developing predictive models or refining loan approval criteria, ultimately enhancing the effectiveness of lending practices and minimizing default risks.

In [ ]: